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Abstract 

The problem of Group Testing is to identify defective items out of a 
set of objects by means of pool queries of the form "Does the pool contain 
at least a defective?". The aim is of course to perform detection with the 
fewest possible queries, a problem which has relevant practical applications 
in different fields including molecular biology and computer science. Here 
we study GT in the probabilistic setting focusing on the regime of small 
defective probability and large number of objects, p — > and ^ oo. 
We construct and analyze one-stage algorithms for which we establish the 
occurrence of a non-detection/detection phase transition resulting in a sharp 
threshold, M, for the number of tests. By optimizing the pool design we 
construct algorithms whose detection threshold follows the optimal scaling 
M oc Np\ logp|. Then we consider two-stages algorithms and analyze their 
performance for different choices of the first stage pools. In particular, via 
a proper random choice of the pools, we construct algorithms which attain 
the optimal value (previously determined in Ref. [16]) for the mean number 
of tests required for complete detection. We finally discuss the optimal pool 
design in the case of finite p. 



1 Introduction 



The general problem of Group Testing (GT) is to identify defective items 
in a set of objects. Each object can be either defective or OK and we are 
allowed only to test groups of items via the query "Does the pool contain at 
least one defective?" . The aim is of course to perform detection in the most 
efficient way, namely with the fewest possible number of tests. 

Apart from the original motivation of performing efficient mass blood 
testing [1], GT has been also applied in a variety of situations in molecular 
biology: blood screening for HIV tests [2], screening of clone libraries [3,4], 
sequencing by hybridization [5,6]. Furthermore it has proved relevant for 
fields other than biology including quality control in product testing [7], 
searching files in storage systems [8], data compression [9] and more recently 
in the context of data gathering in sensor networks [10]. We refer to [11, 12] 
for reviews on the different applications of GT. 

The more abstract setting of GT is the following. We have N items and 
each one is associated with a binary random variable x which takes value 1 or 
0. We want to detect the value of all variables by performing tests on pools 
of variables. Each test corresponds to an OR function among the variables 
of the group, i.e. it returns a binary variable which equals 1 (respectively 
0) if at least one variable of the pool equals 1 (respectively if all variables 
are 0). Here we will only deal with this (very much studied) choice for the 
tests, often referred to as the gold-standard case. It is however important 
to keep in mind for future work that in many biological applications one 
should include the possibility of faulty OR tests [2, 13]. 

In all our study we will focus on probabilistic GT in the Bernoulli p- 
scheme, i.e. the situation in which the status of the items are i.i.d. random 
variables which take value one with probability p and zero with probability 
1 — p. In particular, we will be interested in constructing efficient detection 
algorithms for this GT problem in the limit of large number of objects and 
small defective probability, N ^ oo and p ^ 0. 

In order to summarize our results we need first to introduce some termi- 
nology. The construction of any algorithm for GT involves two ingredients: 
the pool design (the choice of the groups over which tests are performed) 
and the inference procedure (how to detect the value of the items given the 
result of the tests). The pool design can be composed by one or more stages 
of parallel queries. For one-stage (or fully non-adaptive) algorithms all tests 
are specified in advance: the choice of the pools does not depend on the 
outcome of the tests. This would be in principle the easiest procedure for 
several biological applications. Indeed the test procedure can be destruc- 
tive for the objects and repeated tests on the same sample require more 
sophisticated techniques. However the number of tests required by fully 
non-adaptive algorithms can be much larger than for adaptive ones. The 
best compromise for most screening procedures [14] is therefore to consider 
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two-stage algorithms with a first stage containing a set of predetermined 
pools (tested in parallel) and a second stage whose pools are chosen depend- 
ing on the outcomes of the first stage, i.e. after an inference procedure which 
uses the results of the first stage. Concerning the inference procedure, there 
exist both exact and approximate algorithms which lead after the last stage 
to detect the value of all variables with certainty or with high probability, 
respectively. 

Here we will construct one-stage approximate algorithms and two-stage 
exact algorithms. In both cases the pool design for the first stage will involve 
random pools and we will focus on the case N ^ oo and p — > with 
p = (the case /? = stands for p —>■ after N oo). This choice 

was first discussed by Berger and Levenshtein in the two-stage setting in [15] 
where they proved that for /? G (0, 1) the minimal number of tests optimized 
over all exact two-stage procedure, T{N,p) is proportional to Np\ logpj. 

In the one-stage case we will establish the occurrence of a phase tran- 
sition: considering two simple inference algorithms, we identify a threshold 
M such that the probability of making at least one mistake in the detection 
goes to one (respectively to zero) when — s- oo if the number of tests M 
is below (respectively above) M. By optimizing over the pool distribution, 
we will construct algorithms for which the detection threshold shows the 
optimal scaling M = (1 - P){f3y^ (log 2)''^ Np\ logp\. 

Recently in Ref. [16] the value of the prefactor of T has been determined 
exactly when (3 G [0,1/2) for two-stage procedures. More precisely, the 
authors have shown that: limTv^oo T/ {Np\ logp|) = l/(log 2)^. Here we will 
discuss the performance of two-stage algorithms for different choices of the 
first stage pool design. In particular we will show that the optimal value is 
obtained on random pools with a properly chosen fixed number of tests per 
variable and of variables per test (regular-regular case) and also when the 
number of tests per variable is fixed but the number of variables per test is 
Poisson distributed (regular-Poisson case). On the other hand we will show 
that this optimal value can never be attained in the Poisson-regular or in 
the Poisson-Poisson case. Finally, we discuss the optimal pool design when 
N oo and p is held fixed. 

The paper is organized as follows: In Sec. [2] we introduce the factor 
graph representation of the problem in the most general case. In Sec. [3] we 
describe the first simple inference procedure which allows to identify the sure 
variables. In Sec. H] we analyze one-stage approximate algorithms, while in 
Sec. [5] we turn to the two-stage exact setting. Finally, in Sec. [6] we give a 
perspective of our work in view of applications. 
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Figure 1: Left: The bipartite graph corresponding to a single stage. Cir- 
cle (squares) represent variable (test) nodes. We depict by filled (empty) 
squares the tests with outcome one (zero, respectively). Variables i and j 
are sure zeros, variable /c is a sure one, variable m,n and / are undetermined. 
Right: The corresponding reduced graph where the sure variables {i,j,k) 
and the strippable tests (6, c, d) have been erased. Note that one of the three 
variables, /, is isolated. 

2 Pool design: factor graph representation and 
random pools 

As we have explained, a GT algorithm can involve one or more stages of 
parallel tests. The best way to define the pool design of each stage is in 
term of a factor graph representation. We build a graph with two types 
of vertexes: each variable is a vertex (variable node) and each test is also 
a vertex (function node). Variable (function) nodes will be denoted by 
indexes ... (a, 6, ... ) and depicted by a circle (square). Whenever a 
variable i belongs to test a we set an edge between vertex i and a. Thus if 
A'" is the overall number of items and M the number of parallel tests in the 
stage, we obtain a bipartite graph with N variable nodes and M test nodes 
with edges between variables and tests only (in Fig [1] we depict a case with 
TV = 6,M = 4). 

We will denote by Aj (by Pi) the fraction of variable nodes (function 
nodes) of degree i and use a practical representation of these degree profiles, 
standard in coding theory, in terms of their generating functions A(x) = 
X^n>o^"^" and P{x) = J2n>o^n^^- "^^^ average variable node (resp. 
function node) degree is given by X]n>o^"^ ~ ^'(1) (resp. X^„>o-fn'T' = 



Both in the one and two stage case we will use pool designs with a first 
stage based on a randomly generated factor graph. We will consider different 
possible distributions, but in all cases they will be uniform over the set of 
graphs for a fixed choice of the degree profiles A(x) and P{x). Thus the 
probability A; (respectively pk) that a randomly chosen edge in the graph is 



P'(l)). 
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adjacent to a variable node (resp. function node) of degree / (degree k) are 
given by 

which are derived by noticing that the graph F contains n/A; (resp. mkPk) 
edges adjacent to variable nodes of degree I (resp. function nodes of degree 
k). We also define the edge perspective degree profiles as \[x] = A'[2;]/A'(l) 
and p[x] = P'[x]/ P' (1), namely 

^max ^max 

X[x] = ^Xix^-\ p[x] = Y,Pkx'"'. (2) 

1=1 k=l 

Note that the number of checks M can also be written in terms of these 
sequences, because the mean degree of variables, (/) = A'(l), and the mean 
degree of tests, (k) = P'{1), are related by N{1) = M{k). As A'{x) = 
A'(l)A(a;) and A(l) = 1 we get (/) = !// X{x)dx. Therefore 

3 First inference step: sure and isolated variables 

After the first stage of tests we will either use an inference procedure to 
identify the result (in the one stage case) or choose a new set of pools 
based on the outcomes of the previous tests (in the two stage case). In 
our problem, the prior distribution of the variables, x = (xi, . . . ,a;7v), is 
Bernoulh: Bp{x) = Y\i=i'P^'{^ - pY~^\ Given the outputs of the tests, the 
inference problem consists in finding the configuration x which maximizes 

P{x) = ^\{t{Ux)=t^). (4) 

a=l 

Here ta is the value of test a and Ta{x) = if Ylj^Ma^i ^ ^' -^«(^) ~ ^ 
otherwise, where Ma is the pool of variables connected to a. 

Since the minimization of the above function is in general a very diffi- 
cult task, we start by checking whether some variables are identified with 
certainty by the first stage and then try to extract information on the re- 
maining variables (see Fig{T]). The first observation is that in order for a 
variable to be a sure zero it should belong to at least one test with outcome 
zero. On the other hand in order to be a sure one it should belong to at least 
one positive test in which all the other variables are sure zeros. Variables 
that are neither sure zeros nor sure ones are the undetermined variables. 
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We start by noticing that if a test contains only zeros, or if it contains at 
least a sure one, then it does not carry any information on the undetermined 
variables. We call such a test strippable, as is the case for tests 6, c, d in FigHJ 
It is then immediate to verify that we have no information on a variable if it 
is undetermined and all the tests to which it belongs are strippable. We call 
such undetermined variable isolated, as is the case for variable I in Figdl The 
above terminology is motivated by the fact that all the information on the 
undetermined variables is encoded in a reduced graph (see right part of Fig. 
[T]) constructed via the following stripping procedure: erase all variable nodes 
which correspond to sure variables and all test nodes which are strippable 
(note that isolated variables are those that are not connected to any test 
in the reduced graph). Therefore the inference problem corresponding to 
the minimization of Q can be rephrased as a Hitting Set problem on the 
corresponding reduced graph [17]. 

Given a variable i and a choice of the pools, the probability pIq (respec- 
tively pIi) that Xi is a sure zero (resp. a sure one) can be found as follows. 
Let us denote by Ma (Mi) the set of variable (resp. the set of tests) nodes 
connected to test a (resp. variable i). We introduce the indicator Gi{x) that 
Xi is a sure as well as the indicator Vi{x) that Xi is a sure one: 



Gi{x) = {l-Xi)\l- H W,,a{x} 



(5) 



Vi{x) = Xi < 



1 - n 



which are expressed in terms of 

w^,a{x) = 1 - n (1 



Then and p*^ are given by: 

Pio ■.= Y,Bp{x)G,{x) 

X 

Pi, ■.= Y,Bp{x)V,{x) 



(6) 



(7) 



(8) 
(9) 



where the sum is over all x € {0, 1}^. 

It is clear that Eq. ([8]) for p].Q involves only the variables at graph 
distance two from i. Thus, if i does not belong to a loop of length four 
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in the factor graph, Wi^a are independent variables and the mean over the 
variable values in ([8]) can be easy carried out yielding 



Pio = (1 - P) 



1_ Jl 



(10) 



where ka = \J^a\ is the number of variables which belong to test a. Then, 
for any given choice A, P of the degree profiles of the random factor graph, 
if the probability that two tests have degree k and k' factorize, we can easily 
perform the mean over the uniform distribution for the factor graphs. This 
leads to a value which does not depend anymore on the index i and can be 
rewritten as p*Q = (1 — p)So with 

So:=Y.Ai j =l_A[l-p[l-p]]. (11) 

Formula ^ for pl.^ involves only variables at distance at most 4 from 
i. If the ball centered in i of radius 4 does not contain any loop, we can 
perform easily the mean over the variables in ([9|) and get pli = pSi with 



5^ := 1 _ A [1 - p [(1 - p){l - A[l - p[l - p]]]] . 



(12) 



The probability that ta is strippable (-R"), Xi is an isolated zero (/g) and 
Xi is an isolated one (/{) are instead given by 



N N 

Y[{l-x,) + l- l[{l-Vjix)) 
( \ 



ii=Y,B^{x){\-x,) n 



i\=Y,B^{x)x, n 



1- 11(1 -^^•(^)) 

( \ 
I 



1- n(i-^^- 



1 jSAAa 



(13) 



(14) 



(15) 



In this case, if there is no loop in the ball of radius 6 centered on i, we 
can easily perform the mean over the variables and over the random graph 
distribution which yield /q = (1 — p)/ and l\ = pi with 



with 



/ = A[l-p[l-p5i]J , 
:=l_A[l-p[(l-p)(l-A[l-p[l-p]]]] 



(16) 
(17) 
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4 One-stage algorithms 



In this section we analyze one-stage algorithms when the number of items, 
N, goes to infinity and the defect probability, p, goes to zero as p = 
with /? > 0. When constructing the pools we use random graph ensembles 
of two types: either regular-regular (R-R) graphs (fixed connectivity both 
for test and variable nodes) or regular-Poisson (R-P) graphs (fixed connec- 
tivity for variables, Poisson distribution for the test degree). As for the 
inference procedure we will consider two types of algorithms: Easy Algo- 
rithm (EA) and Belief Propagation (BP). We will show that both undergo 
a non-detection/detection phase transition when one varies the number of 
tests, M: we identify a threshold M such that for M < M the overall 
detection error goes (as N oo) to one while for M > M it goes to 
zero. When /3 < 1/3 we can establish analytically the value of M which 
turns out to be equal for the two algorithms: EA and BP have the same 
performance in the large limit. We will explain why this transition is 
robust and we will optimize the pool design (i.e. choice of the parameters 
of the regular-regular and regular-Poisson graphs) to obtain the smallest 
possible M. The resulting algorithms have a threshold value which satisfies 
l[mN^ocM/{Np\logp\) = (1 - /3)/3-^(log2)-2. This is the same scaling in 
A'^ and p as for the optimal number of tests in an exact two-stage algorithm, 
albeit with a different prefactor. 

4.1 Pool design 

Given a random graph ensemble, we denote by M the number of test nodes, 
by K the mean degree of tests (which also coincides with the degree of each 
test in the R-R case) and by L the degree of each variable and we let 

M = cNplogN, K = a/p, L = MK/N = calogN. (18) 

The degree profile polynomials are: 

A^-^[x] = x^, X^~^[x] = x^-^, P^-^[x] = x^, p^-^[x] = x^-^ 

AR-P[x] = x\ A^-^[x] = P«-^N = p^-P[x] = e^(^-^). 

Then, if the hypotheses on the absence of short loops which lead to pi|) . 
(fT2]) and (fTBll are valid, the probabilities Sq, Si and / are given in the R-R 
case by: 

5o = l_(l_(l_p)i^-i)^, (19) 

5i = 1 - {1 - (1 -p)^"Ml - (1 - (1 -p)^-^)^-^-^}^ , (20) 

5i = 1 - {1 - {l-p)^-\l - (1 - (1 -p)^-^)^-r-'}'^"' , (21) 

/= (l-(l-p5l)^-^)^ (22) 
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In the R-P case they are given by: 



So = 1 - {1 - expi-Kp))'^ , 

5, = 1 _ (1 _ exp {-Kp - K{1 -p){l- e-''n''~')f , 
5i = 1 - (1 - exp {-Kp - K{1 -p){l- e-^P)^-^))^~^ 
I = (l-exp{-KpS^)^^ 



(23) 

(24) 
(25) 
(26) 

It is easy to verify that in leading order when N ^ oo and p ^ the 
above quantities for the regular regular and regular Poisson case coincide. 
In particular if we set p = they are given by 



5i 



5o ~ 1 - iV'' 

'(ca log iV)e--(^+^'^'/^) 
1 - iV^ 

I _ jy-ca|log(l--cxp{-2Q!)) 



if /3 + d > 
if /3 + d < 
if /? + d = 



(27) 



(28) 



and 



[ca log N) 



ca logN ^-a^clogN{l+N'^+l^ /b) 



if /3 + d > 
if /? + d < 

if /3 + d = 



(29) 



where we set b = b{a) = (1 — exp(— a)) and d = d{a,c) = — ca|log6|, for 
N ^oo. 

Let us discuss in what range of f3 one expects the above asymptotic 
behaviors to be valid. As explained in section [3l the only hypothesis in 
their derivation consists in neglecting the presence of some short loops in a 
proper neighborhood of the chosen variable. In particular the equation for 

is valid if we can neglect the presence of loops of length four through a 
given variable. Consider for definiteness the R-R case. The probability of 
having at least one loop of length four through i, P{L4), verifies 



P(L4) < L'^N' 



( M \ 
(M 



) 



which goes to zero for (5 < 1/2. Thus we are guaranteed that ()19p is correct 
in this regime. By the same type of reasoning, we can show that the formulas 
for Si and / are valid respectively for /? < 1/4 and /? < 1/6. However 
through the following heuristic argument, one can expect that the formula 
for Si (resp. /) be correct in the larger regimes /? < 1/2 (resp. /? < 1/3). 
Indeed, when we evaluate Si we need to determine whether variables at 
distance 2 from a variable i are sure zeros. We expect the probability of 
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this joint event to be well approximated by the product of the single event 
probabilities if the number of tests that a variable at distance 2 from i 
shares with the others is <C L and if the number of variables that a test 
at distance 3 from i shares with the others is <C K. Both conditions are 
satisfied if /? < 1/2 (the probability that a test at distance 3 belongs to more 
than one variable at distance 2 goes as (1 — K/N)^^ and the probability 
that a variable at distance 4 belongs to more than one test at distance 3 
goes as (1 — K/N)^^^). For I the argument is similar but, since we have a 
further shell in tests and variables to analyze in order to determine whether 
a variable is isolated or not, we get an extra factor KL in the exponents 
which lead to the validity of the approximations only for (3 < 1 /3. 

4.2 Easy algorithm (EA) 

A straightforward inference procedure is the one that fixes the sure variables 
to their value and does not analyze the remaining information carried by 
the tests, thus putting to zero all other variables (since p < 1/2). We call 
this procedure Easy Algorithm (EA). By definition the probability that a 
variable is set to a wrong value, E'^jt, is given by Ef,it = p — pS^. In the 
hypothesis of independent bit errors, i.e. if we suppose that the probability 
Etot of making at least one mistake satisfies Etot = 1 — (1 — Eut)^ and if 
/? < 1/2 (see the discussion at the end of previous section), we can apply 
((291) which yields 



tot 



'I 


- exp(- 






iiP + d>0 


1 


- exp(- 






if /? + d < 


, 1 


— exp(- 




-l3+ca log{l— cxp(— 2a)) 


if /3 + d = 



(30) 



both for the R-R and R-P graphs. Therefore EA displays a phase transition 
in the large N limit, when one varies the parameter c = L/(QlogiV) from a 
region at c < c{a) in which the probability of at least one error, Etot, goes 
to one, to a region c > c{a) where it goes to zero. The threshold of this 
regime is given by 

c(a) = -Tj — — vu (^^) 

a\ log(l — exp(— a))| 

The most efficient pools, within the R-R and R-P families, are obtained by 
minimizing c{a) with respect to a = Kp. The value of the optimal threshold 
c = miuo, c(a) and the parameter d at which the optimal value is attained, 
namely c{a) = c, are 

— « = log2. 
(log 2y 

This, together with ([TH]) . gives a threshold 

M = Arp|logp|(l-/3)r'(log2)-2 (32) 
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Figure 2: a) Error probability as a function of c using EA (red circles) and 
BP (black squares) for a regular-regular graph. The graph parameters are 
chosen as in ^ with p = iV"^, /3 = 1/4, a = q = log2 and N = 43321. 
The continuous line corresponds to the theoretical prediction of Eqs. pop . 
b) Error probability as a function of c for EA. We set again /3 = 1/4 and 
a = log 2, while we choose = 1109 (green diamonds), A^ = 10401 (blue 
squares), and A^ = 63426 (red circles). The vertical dashed line corresponds 
to the threshold c, given by Eq. ([^T]) . 

for the number of tests. Note that the threshold in the case (3 = 0, i.e. if 
we send p — > after A^ oo, is infinite. This corresponds to the fact that 
for any choice M = CNp\ logp\ and K = a/p the bit error p{l — Si) stays 
finite when N ^ oo, since K and L depend only on p. 

In order to verify the above results and the approximations on which 
they are based we have performed numerical simulations in the case of the 
R-R graph with (3 = 1/4, a = a and different values of c. The results we 
obtain confirm that in this regime bit errors can be regarded as independent 
and formulas (|19 p - (|22p are valid. The values of Etot as a function of c are 
depicted in Fig. for different values of A^. The value of the threshold 
connectivity and the form of the finite size corrections for the total error 
(continuous curves) are in excellent agreement with the above predictions 
(I30|) and (I3ip . Furthermore we have verified that when /? > 1/2 both the 
independent bit error approximation and the approximation leading to Eq. 
pop fail as expected. This can be seen for example in Fig. [3ti where we 
report the results for the case (3 = 2/2,. Indeed the numerical results (black 
dots) differ from the continuous line which corresponds to Eq. (I30p . thus 
confirming that in this case both the shape of finite size corrections and the 
position of the threshold cannot be derived by ()30p . 
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4.3 Belief Propagation (BP) 

The algorithm considered in previous section is very simple since it does not 
exploit the information contained in the reduced graph (see section [3]) . We 
will now instead define a different algorithm in order to extract as much 
information as we can from the first stage. As already explained in section 

this requires in principle the minimization of Eq. Q. In order to perform 
this task we will use Belief Propagation (BP) algorithm to estimate for each 
variable i the value of the marginal probability P{xi). Then we will set to 
one (to zero) variables for which P{xi) > 1/2 (respectively P{xi) < 1/2). 
Let us derive the BP equations for the marginal probabilities. We denote by 
A/i (ATa) the set of function (variable) nodes connected to the variable node 

1 (respectively to the function node a), by P^XiY^'^ the probability of value 
Xi for the i-th variable in absence of test a, and by P{x\, X2^ ■ ■ ■ x^ Y") the 
joint cavity distribution in the absence of a (so that P{xiy^°' = P{xi)^°'^)). 
We can then write 

/ 



\ a,' 



where by xq^ . we denote the vector {xj\j € Ma \ i}- Furthermore we make 
the usual assumption that the joint cavity distributions P{xq^ J*-*^ factorize 

p{sa.,.t^= n ^^'^(^.)= n p^^^^y^' 

which leads to closed equations for the set of single variable cavity probabil- 
ities. In order to simplify these equations we define a normalized message 
P{xi)"'^'^ from function node a to variable node i as 

P{x,r-':=C P{x,tH{Ta{x)=ta) 



and therefore 



PixiY^^ = Bp^^il-p^-''^ Yl P{ 

b£j\fi\a 



Xi)''^' 



and 



P{xi) = Bp^^il-p)'-^^ H P{xif-\ 

beAfi 



Using the fact that Xi takes values in {0, 1} and that both P'^"** and p*"*"^ 
are normalized we introduce cavity fields hi^a and cavity biases Ua^i defined 
as follows 

Pixi)"^' = (1 - Ua^i)6^^fi + Ma^i4„l 
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P(Xi)*""'' = (1 - hi^a)Sx,,0 + hi^aSx,,!- 

The BP equation for the cavity biases and fields are: 



r 



if = 




1 



2- n (i-^j--) 



if ta = 1 



jeMa\i 



and 



Our detection procedure corresponds to initiahze the cavity and bias fields 
to some values and iterate BP equations above until they converge. Then, 
the marginal probability distribution P{xi) can be rewritten as 



and the inference procedure is completed by setting Xi to one (to zero) 
if Hi > 1/2 [Hi < 1/2 respectively). Note that on the sure variables BP 
algorithm lead to the correct detection. Furthermore one should expect that 
its performance is better than EA: since we analyze also the information 
which comes from tests which are non strippable it is possible that some 
of the undetermined ones which are all set to zero in EA are here correctly 
detected. 

In order to test the performance of BP algorithm we run the procedure 
on the regular-regular graph for /? = 1/4 and a = log 2 as we did for EA. 
The total error probability as a function of c is reported in figure [2)3 (black 
squares). As for EA, a non-detection/detection phase transition occurs at 
c = l/(log2)^. Thus, even if EA is a much simpler algorithm, the perfor- 
mance of the two coincide in the large limit, suggesting that for the choice 
p = with /3 = 1/4 the reduced graph does not carry any additional 

information. In figure E^) we plot instead the total error of EA and BP 
when /3 = 2/3 for = 2^^. The data indicate that BP algorithm performs 
much better than the EA in this case: the reduced graph carries informa- 
tion which is used by BP to optimize the procedure. We have also verified 
that the difference between BP and EA performance does not diminish as 
the size of the graph is increased. In Fig. [Sb) we plot instead the results 
for BP again in the case /? = 2/3 but for different values of N. The data 
become sharper as A'' is increased. Similarly to the /3 = 1/4 case, this seems 



P{xi) = (1 - Hi)6x„o + Hi6x,,i 



with the full local field Hi satisfying 



Hi = 



PUbeK^b-*i 



PUbeM^ ^b^i + (1 - P) n6gM(l - ^b^i) 



12 




c c 

Figure 3: a) Error probability as a function of c for a regular-regular graph 
using EA (black squares) and BP (red circles). The graph parameters are 
chosen as in ([TH]), with p = N'f^, (3 = 2/3, a = 1 and iV = 2^^ The 
continuous line corresponds to formula ()30p . As explained in the text, the 
discrepancy between the latter and the numerical results confirms that in 
this regime the approximations leading to ()30p are not verified, b) Error 
probability as a function of c using BP. We set again /3 = 2/3,a = l and we 
choose N = 2^^ (red circles), 2^^ (blue squares), and 2^ (green diamonds). 
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to indicate the presence of a sharp phase transition in the thermodynamic 
hmit. 

Let us start by evaluating the non-detection/detection thresliold from 
BP equations and then explain why we expect it to coincide with the one 
for EA at least when (3 is in (0, 1/3). 

If we denote by V^{H) {V^{H)) the mean over the random graph distri- 
bution of the probability for the full local field on i conditioned to the fact 
that Xi = {xi = 1), the probability of setting to a wrong value the i-th 
variable is here = E^^^ + E'^-^ with 

El, = il-p) Cv\H)dH (33) 

1 

Elt=P rv\H)dH. (34) 

JO 

Prom the BP equations it is easy to obtain the following 'replica sym- 
metric' cavity equations satisfied by V^{H) and V^{H) [18]: 

V'{h) = 5: A/ / n dQ'{u,)6 (h - ^n^^Tnn s) (^5) 

^ J fj^ V pUbUb + {i-P)Ubi^-ub)J 



\h) = Y,^i I \{dQ\u,)5(h-— -^+\,5{h-l) 

J ^Ji V pUbUb + {i-p)Ub(^-ub)J 

(36) 

where 



J L\ ^ ^ " \ pUb^b + ii-p)Ubi^ 



(37) 



Ub) 



(38) 



k " j=i 

k-l 



^pW(l-p)(l-^^)dPW(/lj) 



u 



i=i 



2-n,(i-/i,) 



(39) 
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k j=l Vj 



(40) 




It is now easy to verify that ^^(0) = and 'P^(l) = 5*1, where Sq and 
Si are the probability that a variable is sure zero and one respectively, and 
are given by Eqs. (I19p and pop . Furthermore the following relation holds 



V\p) = VHp) > A[Q°(l/2)] = A[QHi/2)] = A[l - p(l - pSi)] = I 



where / is the probability that a variable is isolated, given in Eq. ()22p . 

By using the above observations together with the definitions ([HH|) and 
()34p for the bit error probabilities one obtains the following inequalities 



We will now show how it is possible to locate the non-detection/detection 
transition from these inequalities without the need to evaluate the bit error 
probabilities. 



The leading order of the quantities Sq, Si and / have been evaluated 
in section 14.11 Furthermore, for P + d < the higher order corrections 
give So = 1- N'^ - fN~f^+'^ log iV and / = iV^ - /iV^"^ log N where / = 
exp(— q)(q/2 + 1)/(1 — exp(— a)). Thus 



l_exp(-iVi-/3+'i) < Etot = 1-(1 - - El,Y < l-exp(-iVi-^+'^logiV) 



for /? + (i < 0, namely ca\ log(l — exp(— a))| > (3. Since /? < 1/2 we have 
1 — P > (3 and the above bounds on the total error imply the occurrence of 
a phase transition at the same value c{a) found with the EA algorithm (see 
()3ip ). Thus the performance of EA and BP coincide if the approximations 
leading to Eqs. (fT9]l . ((20]) and (f22|) are correct. By the discussion at the 
end of section 14.11 we know that these approximations are under full control 
for /? < 1/6 and we expect them to hold also up to /? < 1/3. We conclude 
that in this regime the value of the threshold for BP transition equals the 
one for EA ()3ip . as is indeed confirmed by the numerical results that we 
already discussed for the case /? = 1/4 (see Fig. [2|). We stress that there 



EL < (1 -P)(l - P"(0) - V'ip)) = {l-p){l -So -I) 

pi = pr\p) < < p{i - v\i)) = p{i - Si). 



(41) 
(42) 



^-(^+d < ^^.^ < 2fN-'^+'^ log N . 



Therefore, in the assumption of independent bit errors, we get 
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is no reason for that to be true in the regime where the approximations of 
neglecting proper loops which lead to (|19p . (|20p and (j22p do not hold. For 
example, as is shown in FiglSK and b, in the case /3 = 2/3 even if a sharp 
non-detection/detection phase transition seems to occur when N ^ oo, the 
error probability is certainly not in agreement with (I30p which for the chosen 
parameters would yield to a threshold at c ~ 1.453. 

Note that in the discussion above we have upper bounded the bit error 
with the error over all variables that are neither sure nor isolated and lower 
bounded it with the error over isolated variables. It is thus immediate 
to see that the position of the phase transition remains unchanged for all 
algorithms which set to zero all the isolated variables (which is the best 
guess since we have no information and p < 1/2) and set to the correct 
value the sure variables (EA is indeed the simplest algorithm which belongs 
to this class). This is due to the fact that the mean number of tests in the 
reduced graph goes to zero in the detection regime —d > 1 — (3 > 2/3, as 
can be checked using formula (fT3]) and neglecting loops. 

Finally, we would like to stress that even if we have shown that EA and 
BP inference procedures are optimal for R-R and P-R pool designs, at least 
when /? < 1/3, this does not imply that these pool designs are optimal over 
all the possible designs of the factor graph. However, an indication that 
they might be optimal comes from the results on two-stage exact algorithms 
presented in section [5j As a further check we have evaluated the thresholds 
for the Poisson-Poisson (P-P) and Poisson-regular (P-R) cases. Using the 
same technique as above, we found in both cases a non-detection/detection 
phase transition which occurs at the same threshold for EA and BP. If we 
set K = a/p, M = ca log A^, L = ca\ogp the threshold value is 

c{a) = (43) 
a exp(— aj 

By optimizing (jl3]) over the choice of a we get a = 1 and M = eNp\ logp|, 
which is larger than the optimal threshold for R-R and R-P. 



5 Two-stage algorithms 

In this section we analyze two-stage exact algorithms when the number 
of items, N, goes to infinity and the defect probability, p, goes to zero as 
p = 1/A^. This setting was first discussed by Berger and Levenshtein in [15] 
where they proved that if < /3 < 1, the minimal (over all two-stage exact 
procedures) mean number of tests, T(N,p), satisfies the bounds 

^< hm mpL<^. 

log 2 N ^oo Np\ log p\ (3 
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In [16] two of the authors have derived the prefactor for the above scahng 
when < /3 < 1/2, 

hm = -J— (44) 

N~*oo Np\logp\ (log 2)2 ^ ' 

and constructed a choice of algorithms over which this optimal value is 
attained. Note that our analysis includes the case (3 = 0, namely the sit- 
uation in which the limit p ^ is taken after N ^ oo. Note that the 
asymptotic result (j44p is l/log2 above the information theoretic bound 
T{N,p) > A'^pI logpl/ log 2. In section [5.11 we give a short account of the 
derivation of (I44p and we construct an optimal algorithm. In section [5^ we 
test the performance of algorithms corresponding to different choices of the 
random pools of the first stage. 

5.1 Optimal number of tests for p = (3 G (0, 1/2] 

An exact two-stage algorithm involves a first stage of tests after which all 
variables are identified and set to their value. Then a second stage is per- 
formed where all the remaining variables are individually tested. The mean 
number of tests, T{N,p), is therefore given by 

N 

T{N,p) = M + N- Y.ip'so + Pli) (45) 

i=l 

where M is the number of tests of the first stage and p*Q and pl^ are the 
probabilities for variable i to be sure zero and sure one. The latter in turn 
are given by Eqs. ([8]) and ([9]) with AAq's and A/i's being the neighborhood of 
tests and variables of the first stage. 

It is immediate to verify that in the limit — > oo and p ^ the number 
of individual check over undetected ones is irrelevant, i.e. 

T{N,p) _M + N-j:liPio (4g) 



Np\logp\ Np\logp\ 

Furthermore p*Q is always upper bounded by the expression ([TU]) obtained 
by neglecting loops, as is proven in [16] by using Fortuin-Kasteleyn-Ginibre 
inequality [20] together with the observation that the existence of at least 
one variable equal to one in two (or more) intersecting pools are positively 
correlated. We define /(m) to be the fraction of sites such that among their 
neighbors there are mi tests of degree 1, 1712 tests of degree 2, etc. By using 
and ([Ml), the optimal number of tests over all two stage procedures can 
be lower bounded as 

TiN.p) ^ / E^/K<E,'L.Si + (l-riP(^)^ ^^^^ 



Np\logp\ f{m) \ P|logp| 
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where the infimum is over all possible probability distributions / : (1, . . . N)^ 
TV' with Ylm fi^) — 1 



N 



P{rh) = J](l - (1 - py-i)™^' . (48) 



i=l 



Minimization over /(m) can then be carried out and leads in the limit p — > 
to _ 



Np\\ogp\ (log 2)2 

Furthermore the above minimization procedure shows that this infimum is 
attained for /(m) = 5^^^ with rhi = (5j iog2/j,[| logp|/ log 2]. This implies 
that the lower bound is saturated on the uniform distribution over regular- 
regular graphs with = [| logp|/log2] and K = [log2/p] provided that we 
can neglect loops in the evaluation of p\q. This, as already explained in 
section WA\ is true as long as (5 < 1/2. Note that the optimal result is also 
attained if instead of a random construction of pools we fix a regular-regular 
graph which has no loops of length 4 and has the same choices of test and 
variable degrees as above. The existence of at least one of such a graph 
for these choices of K and L when /3 < 1/2 is guaranteed by the results 
in [19]. Thus we have established the result (I44p for the optimal value of 
tests over all exact two-stage procedure and constructed algorithms based 
on regular-regular graphs which attain this optimal value. 

5.2 Testing different pool designs for p ^ 

We will now check the performance of different pool designs corresponding to 
different random distributions for the pools in the first stage. In all cases we 
will fix the degree profiles A and P and consider a uniform distribution over 
graphs with these profiles. Using the notation of section [3] and neglecting 
the presence of loops, the mean number of tests (j45|) can easily be rewritten 

N l^iXi/l ^ (50) 

+ pa[i - p[{l - p)il - X[l - p[l - p] 

(we suppose that the fraction of both test and variable nodes of degree 
zero is equal to zero). As for the one stage case, we consider four different 
choices of the connectivity distributions corresponding to regular-regular 
(R-R), regular-Poisson (R-P), Poisson-Poisson (P-P) and Poisson-regular 
(P-R) graphs and for each choice we have optimized over the parameters of 
the distribution. The corresponding degree profiles and edge perspectives 
are given in section 14.21 and 14.31 The first term of the r.h.s. of Eq. ()50p 
corresponds to the total number of tests of the first stage per variable, i.e. 
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L/K, while the second and third terms correspond to (1— — S'o) andp^i 
respectively, where Sq and Si have already been evaluated in the previous 
section (see Eqs. ([20]), ([23]), ^). 

We now let K = a/p and L = ca \ \ogp\ + v (in order to keep corrections 
in M to the leading term Np\ logp|) and we evaluate ()50p for the different 
pool designs. Then we optimize over the parameters a and c. 

5.2.1 Regular- Regular and Regular-Poisson case 

If we set d = ca\ log(l — exp(— a))|, both in the R-R and R-P case we get 

^^^'^^ = cp\ logpl + vp/a + /(I - exp(-a))^ + o(/+'^). (51) 

Thus the optimal value for p — > is given by d = 1, namely 

1 



c(a) 



a\ log(l — exp(— a))| 



By optimizing over a we get a = log 2 and c = l/(log 2)^. Then minimizing 
over V we get 



T / 1 ^ ^ 



Np Vlog2 



logpl + 1 + 21oglog2) (52) 



5.2.2 Poisson-Poisson and Poisson-Regular case 

If we set / = caexp(— a), for both the P-P and P-R case we get 

— ^^'^^ = cp\ logp\ + vp/a + p^ exp(— z; exp(— a)) + o{p^~^^). (53) 

Thus the optimal value for p — > is given by / = 1, namely 

1 



c{a) 



a exp(— a) 



By optimizing over a we get a = 1 and c = e. Then minimizing over v we 
get V = —e, thus 

— = ejlogpl + o(p^). (54) 
5.3 Optimal algorithms at finite p 

The above results show that both for regular-regular and regular-Poisson 
graphs the optimal asymptotic value ()44p can be reached in the case p — > 0, 
while this is true neither in the Poisson-Poisson nor in the Poisson-regular 
case. Note however that this does not exclude the existence of other dis- 
tributions for which the optimal value is attained. We stress once more 
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that even if when we performed optimization we did not make any assump- 
tion on how p — > 0, the results hold only if proper loops can be neglected 
in the resulting optimal graphs. This includes the following regimes: ei- 
ther p — > after — > oo or p = with P < 1/2. The reason why 
we focused on the p — > limit is twofold. On the one hand one often 
deals in practical applications with problems in which the defective prob- 
ability is small. On the other hand the information theoretic lower bound 
T{N,p) > Np\ logpl/ log 2 already tells us that ii p -/^ the number of tests 
is proportional to A'^ as in the trivial procedure which tests all variables 
individually. However one could be interested in the optimal random pool 
design for the first stage if instead p is held fixed. A natural conjecture in 
view of the results of the previous sections is that, at least for sufficiently 
small p, this corresponds again to a regular-regular graph. In order to solve 
this problem one should find the best degree sequences A, P which mini- 
mize the expression ()50p . This is a hard minimization problem which we 
simplified by first proving that (for a general choice of A^ and p) at most 
3 coefficients A/ and at most 5 coefficient Pr are non zero in the optimal 
sequence. Plugging this information in some numerical minimization proce- 
dure of (fSOll , we have observed that for most values of p the optimal degree 
sequence is the regular-regular one. There are also some values where the 
optimal graph is slightly more complicated. For instance for p = .03, the 
best sequences we found are A[x] = and P[x] = .45164 x'^^ + .54836 x^^, 
giving T = .25450 , slightly better than the one obtained with the optimal 
regular-regular one, A[x] = and P[x] = x^^, giving T = .25454 . But 
for all values of p we have explored, we have always found that either the 
regular-regular graph is optimal, or the optimal graph has superposition of 
two neighboring degrees of the variables, as in this p = .03 case. In any case 
regular-regular is always very close to the optimal structure. In Fig. |3]we 
depict the expected mean number of tests (divided by the information theo- 
retic lower bound NH[p) = A^(plog2P+ (1 — p) log2(l — p))) obtained by the 
numerical minimization of (|50p on the ensemble of regular-regular graphs. 
In the small p limit the curve goes asymptotically to 1/ log 2 as predicted by 
(j44p . In Figl5]we depict instead the corresponding optimal degree couples 
L. Note that the non-analyticity points for the expected mean number 
of tests correspond to the values of p where the optimal degree pair L, K 
changes. 



6 Perspectives 

As recalled in the introduction. Group Testing strategies are used in a variety 
of situations ranging from molecular biology to computer science [1]- [12]. In 
most of the applications it is important to take into account the possibility 
of errors in the tests answers [2,13,29-31], i.e. to consider the faulty-case 
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Figure 4: Expected mean number of tests divided by the information theo- 
retic lower bound NH{p) = N{plog2P + (1 — p) log2(l — p)) for the regular- 
regular graphs which optimize (jSOp . The non-analyticity points correspond 
to the values of p where the optimal degree pair L, K changes, see FigJSJ In 
the small p limit the curve goes asymptotically to 1/ log 2 in agreement with 




log(p) 



Figure 5: Values of L (continuous line) and of log K (dotted line) corre- 
sponding to the couples L, K which give the optimal mean number of tests 
of FigH 
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instead of the gold-standard case analyzed in this work. BP equations for 
cavity biases and fields analogous to those of Section 14.31 can be derived 
also in the faulty setting and a natural development of the present work is 
to analyze the performance of the corresponding BP algorithm. A similar 
task has been performed in [13] for a setting relevant for fault diagnosis in 
computer networks. 

It is important to notice that the relevant form of the test errors depends 
on the specific application at hand. In the majority of the situations in which 
GT is a useful tool, one can assume that the errors occur independently in 
different pools. Thus the error model is completely defined by the probability 
of false positive and false negative answers, which are usually either pool 
independent or they depend only on the size of the pool. An example of the 
latter situation is given by blood screening experiments for which the false 
negative probability increases with the size of the pools due to the inevitable 
dilution effect [2,29]. 

Finally, it is important to bear in mind that, at variance with our analy- 
sis, in practical situations one should take into account finite size corrections 
as well as the fact that the maximal size of the pool may be limited by ex- 
perimental constraints. 
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